Accelerating Blocked Matrix-Matrix Multiplication using a Software-Managed Memory Hierarchy with DMA

نویسندگان

Roland E. Wunderlich

Markus Püschel

James C. Hoe

چکیده

The optimization of matrix-matrix multiplication (MMM) performance has been well studied on general-purpose desktop and server processors. Classic solutions exploit common microarchitectural features including superscalar execution and the cache and TLB hierarchy to achieve near-peak performance. Typical digital signal processors (DSPs) do not have these features, and instead use in-order execution, configurable memory hierarchies, and programmable I/O interfaces. We investigate the methods needed to achieve high performance MMM on the Texas Instruments C6713 floatingpoint DSP. This processor has two components that can be used to accelerate MMM: a software-managed memory hierarchy, and a direct memory access (DMA) engine that can perform block copies from main memory to into the memory hierarchy. Our MMM implementation overlaps computation with DMA block transfers. For matrices larger than the data caches, we observed a 46% performance increase over a blocked MMM implementation, and a 190% increase over the Texas Instruments DSP library.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimizing Matrix-matrix Multiplication for an Embedded Vliw Processor

The optimization of matrix-matrix multiplication (MMM) performance has been well studied on conventional general-purpose processors like the Intel Pentium 4. Fast algorithms, such as those in the Goto and ATLAS BLAS libraries, exploit common microarchitectural features including superscalar execution and the cache and TLB hierarchy to achieve near-peak performance. However, the microarchitectur...

متن کامل

Space-time Tradeoos in Memory Hierarchies Space-time Tradeoos in Memory Hierarchies

The speed of CPUs is accelerating rapidly, outstripping that of peripheral storage devices and making it increasingly di cult to keep CPUs busy. Multilevel memory hierarchies, scaled to simulate single-level memories, are increasing in importance. In this paper we introduce the Memory Hierarchy Game, a multi-level pebble game simulating data movement in memory hierarchies for straight-line comp...

متن کامل

High Performance Computing with the Cell Broadband Engine

The Cell Broadband Engine was conceived to enable the design of novel and highly efficient systems for compute-intensive applications. The Cell/B.E. departs from prior architectures by adopting a heterogeneous chip multiprocessor architecture with novel accelerator cores and an explicitly managed memory hierarchy. The increased computing density of the design improves peak performance as well a...

متن کامل

Optimized Dense Matrix Multiplication on a Many-Core Architecture

Traditional parallel programming methodologies for improving performance assume cache-based parallel systems. However, new architectures, like the IBM Cyclops-64 (C64), belong to a new set of manycore-on-a-chip systems with a software managed memory hierarchy. New programming and compiling methodologies are required to fully exploit the potential of this new class of architectures. In this pape...

متن کامل

Data prefetching for linear algebra operations on high performance workstations

In a previous work it was shown that the performance of linear algebra computations , which access large amounts of data, is dependent on the behavior of the memory hierarchy. This research is aimed to use the multilevel orthogonal blocking approach in conjuntion with other software techniques to further improve the performance of linear algebra computations. The performance of the dense matrix...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

Accelerating Blocked Matrix-Matrix Multiplication using a Software-Managed Memory Hierarchy with DMA

نویسندگان

چکیده

منابع مشابه

Optimizing Matrix-matrix Multiplication for an Embedded Vliw Processor

Space-time Tradeoos in Memory Hierarchies Space-time Tradeoos in Memory Hierarchies

High Performance Computing with the Cell Broadband Engine

Optimized Dense Matrix Multiplication on a Many-Core Architecture

Data prefetching for linear algebra operations on high performance workstations

عنوان ژورنال:

اشتراک گذاری